Life Expectancy Model

Introduction

In this report we are working with data compiled by WHO on the life expectancy on multiple countries across the world. Life expectancy has been steadily increasing over the past few decades as technology and countries advance, but by analyzing this data we can get a much more granular idea of exactly which factors affect the Life Expectancy of a country.

Dataset being used: https://www.kaggle.com/kumarajarshi/life-expectancy-who

Method

Data Cleaning

library(readr)
lifexdata <-  read.csv("Life Expectancy Data.csv")
df <- data.frame(lifexdata)
df <- na.omit(df) # if we don't do this, the different models will show below will have different numbers of rows, preventing us from comparing them directly through ANOVA and other evaluation criteria

We can see the variables we are working with: Country, Year, Status, Life.expectancy, Adult.Mortality, infant.deaths, Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, GDP, Population, thinness..1.19.years, thinness.5.9.years, Income.composition.of.resources, Schooling

  • Number of observations: 2938
  • Number of variables : 22
  • Response variable: Life expectancy Type: Numeric
  • Potential Categorical Predictor variables:
  • Country
  • Status
  • Year
  • Potential Numeric Predictor variables: -Adult Mortality
  • infant deaths
  • Alcohol
  • percentage expenditure
  • Hepatitis B
  • Measles
  • BMI
  • under-five deaths
  • Polio
  • Total expenditure
  • Diphtheria
  • HIV/AIDS
  • GDP
  • Population
  • thinness 1-19 years
  • thinness 5-9 years
  • Income composition of resources
  • Schooling

Factor Variables

We have 3 contenders to be Factor Variables:

Predictor No. of Levels
Country 193
Status 2
Year 16

We can immediately eliminate Country from our data frame. It's size would make our model too large and unwieldy. Similarly even Year is also extremely large, therefore we shall be using Status as our sole Factor variable.

drop <- c("Country", "Year")
df <- df[,!(names(df) %in% drop)]
df$Status <- as.factor(df$Status)

Creation of Meaningful Country Variable: Per-Capita GDP

Meanwhile, one major characteristic of countries is their GDP, a reflection of the extent to which countries produce wealth. However, totaL GDP does not account for population. A large population could have relatively low per-person output. However, one would expect GDP per capita to be a major influencing factor for the simple reason that rich countries should presumably be able to help maintain a healthier population than poor countries. Accordingly, we create a new variable per_cap_GDP to capture this possibility. Unfortunately, the values become tiny and may be subject to rounding errors. We therefore translate this into a log(per_cap_GDP).

df$per_cap_GDP <- df$GDP / df$Population # values too small
df$log_per_cap_GDP <- log(df$per_cap_GDP) # better scale

Training and Testing Sets

Now that we have finished cleaning our full data set, we will create out Training and Test data, utilizing the recommended 80-20 Training-Test split:

trn <- sort(sample(nrow(df), nrow(df) * 0.8))
train_df <- df[trn,]
test_df <- df[-trn,]

Division of Variables into Categories

An obvious question to start out with is whether there is significant collinearity among the variables. And one would expect a pairs plot to give us quick insight into the matter. However, because there are so many variables, it is impossible to see anything if one applies "pairs" to all the numeric variables. Therefore, we have chosen to illustrate pair plots for categories of information. we have divided the data into four categories, namely: weight, disease, money, and lifestyle.

Money

money <- subset(train_df, select = c(Life.expectancy, percentage.expenditure, log_per_cap_GDP, Income.composition.of.resources, GDP))
pairs(money, col = "darkgreen", main = "Economic Factors and Life Expectancy")

Some of the pairs plots are a little unusual. For example log_per_cap_GDP vs Income.composition.of.resources exhibits one large blob of data on the right, and a completely separate vertical line on the left (the respective orientations are top and bottom if log_per_cap is on the x axis). It is as if some other interacting factor is causing there to be two types of relationships between these variables. Income.composition.of.resources exhibits the same divided behavior when plotted vs Life.expectancy.

If we check for correlations numerically, we find the following:

(cor_money <- round(cor(money), 2))
##                                 Life.expectancy percentage.expenditure
## Life.expectancy                            1.00                   0.41
## percentage.expenditure                     0.41                   1.00
## log_per_cap_GDP                            0.37                   0.30
## Income.composition.of.resources            0.72                   0.41
## GDP                                        0.45                   0.95
##                                 log_per_cap_GDP Income.composition.of.resources
## Life.expectancy                            0.37                            0.72
## percentage.expenditure                     0.30                            0.41
## log_per_cap_GDP                            1.00                            0.32
## Income.composition.of.resources            0.32                            1.00
## GDP                                        0.36                            0.46
##                                  GDP
## Life.expectancy                 0.45
## percentage.expenditure          0.95
## log_per_cap_GDP                 0.36
## Income.composition.of.resources 0.46
## GDP                             1.00

There are some highly correlated predictors, such as GDP and percentage.expenditure (the correlation is 0.95). Clearly one of those two variables should be eliminated. Since we have already expressed GDP in terms of a transformed variable (log_per_capita_GDP), it makes more sense to eliminate GDP.

Disease

diseases <- subset(train_df, select = c(Life.expectancy, Hepatitis.B, Measles, Polio, Diphtheria, HIV.AIDS))
pairs(diseases, col = "darkred", main = "Pair Plots for Diseases and Life Expectancy")

If we examine pair plots for variables related to disease, we see the same strange phenomenon of multiple relationships between a single pair of variables. For example, polio and diphtheria exhibit not one but three collinear behaviors. When plotted against hepatitis or life expectancy, both of those diseases show the same kind of "line on left, blob on right" relationship that we saw earlier for life expectancy vs BMI and other variable pairs.

Checking for correlations numerically, we find the following.

(cor_diseases <- round(cor(diseases), 2))
##                 Life.expectancy Hepatitis.B Measles Polio Diphtheria HIV.AIDS
## Life.expectancy            1.00        0.21   -0.08  0.33       0.35    -0.59
## Hepatitis.B                0.21        1.00   -0.14  0.48       0.61    -0.10
## Measles                   -0.08       -0.14    1.00 -0.07      -0.06     0.00
## Polio                      0.33        0.48   -0.07  1.00       0.62    -0.11
## Diphtheria                 0.35        0.61   -0.06  0.62       1.00    -0.13
## HIV.AIDS                  -0.59       -0.10    0.00 -0.11      -0.13     1.00

In this case, there are no obvious candidates for removal, as the correlations tend to be small, and in fact, frequently slightly negative.

Complex Relationships

As for the curious behavior where polio and diphtheria exhibit three collinear relationships that are governed by another interactive variable, it is unclear to us what might be causing that behvior. Three obvious candidates for interactive variables would be Status, Year, and log_per_cap_GDP, but none of these pan out.

par(mfrow = c(1,3))
colors <- as.numeric(as.factor(df$Status)) + 1
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
     main = "Colored by Status")
colors <- as.numeric(as.factor(df$Year)) + 1
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
     main = "Colored by Year")
colors <- as.numeric(trunc(df$log_per_cap_GDP)*-1) + 5 # to achieve positive values
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
     main = "Colored by Log Per-Cap GDP")

None of these plots show a clear one-to-one relationship between the color (indicating Status, Year, or log_per_capita_GDP) and the three linear trends having differing slopes. We leave this perplexing phenomenon unexplained, but are confident it will not play a major role in either our explanatory or predictive models below. As mentioned earlier, other pairs exhibited similar "dual-relationship" patterns (esp. thsoe with a line on left and blob on right). Explaining these complex relationships will remain outside the scope of this study.

Weight

weight <- subset(train_df, select = c(Life.expectancy, BMI, thinness..1.19.years, thinness.5.9.years))
pairs(weight, col = "darkblue", main = "Pair Plots for Weight Indices and Life Expectancy")

When plotted against life expectancy, BMI shows a vertical line on the left and a blob on the right. Like the case of Polio vs Diphtheria, this suggests that there is a hidden interactive variable--in this case, one that produces two separate (approximately) linear relationships.

There also appears to strong collinearity between the two thinness variables. We can check the degree of collinearity numerically as follows.

(cors_weight <- round(cor(weight), 2))
##                      Life.expectancy   BMI thinness..1.19.years
## Life.expectancy                 1.00  0.56                -0.47
## BMI                             0.56  1.00                -0.55
## thinness..1.19.years           -0.47 -0.55                 1.00
## thinness.5.9.years             -0.47 -0.55                 0.92
##                      thinness.5.9.years
## Life.expectancy                   -0.47
## BMI                               -0.55
## thinness..1.19.years               0.92
## thinness.5.9.years                 1.00

As expected, the correlation between thinness..1.19.years and thinness.5.9.years is extremely high: 0.92. One of these should be dropped from the weight model. Since 1-19 is more encompassing than 5-9, we choose to keep the former.

Lifestyle
Finally, there are some miscellaneous variables that do not really fall into one category. They cover a miscellany of topics about life and death--including alcohol (consumption), adult mortality, infant or children death, schooling, and population. One could call this category "lifestyle and death", but we will refer to it as "lifestyle" for short. As above, these variables are plotted against life expectancy.

lifestyle <- subset(train_df, select = c(Life.expectancy, Alcohol, Adult.Mortality, infant.deaths, under.five.deaths, Schooling, Population, Status))
pairs(lifestyle, col = "cadetblue4", main = "Lifestyle, Schooling, Population, and Death")

From the plots, we suspect nearly perfect collinearity between infant.deaths and under.five deaths. A check for correlations confirms that.

lifestyle_numeric <- subset(train_df, select = c(Life.expectancy, Alcohol, Adult.Mortality, infant.deaths, under.five.deaths, Schooling, Population))
(cor_lifestyle <- round(cor(lifestyle_numeric), 2))
##                   Life.expectancy Alcohol Adult.Mortality infant.deaths
## Life.expectancy              1.00    0.41           -0.71         -0.18
## Alcohol                      0.41    1.00           -0.19         -0.10
## Adult.Mortality             -0.71   -0.19            1.00          0.04
## infant.deaths               -0.18   -0.10            0.04          1.00
## under.five.deaths           -0.20   -0.10            0.05          1.00
## Schooling                    0.73    0.61           -0.42         -0.22
## Population                  -0.03   -0.03           -0.02          0.70
##                   under.five.deaths Schooling Population
## Life.expectancy               -0.20      0.73      -0.03
## Alcohol                       -0.10      0.61      -0.03
## Adult.Mortality                0.05     -0.42      -0.02
## infant.deaths                  1.00     -0.22       0.70
## under.five.deaths              1.00     -0.23       0.69
## Schooling                     -0.23      1.00      -0.05
## Population                     0.69     -0.05       1.00

Here we see some strong correlations between variables and life expectancy. For example, adult mortality has a strong negative correlation, and schooling has a strong positive correlation. But in terms of collinear predictors, nothing jumps out here, so for now we can keep these predictors within our scope of consideration.

Preliminary Thematic Models

We can create temporary models based on these themes just to get a rough idea if these categories of variables tend to have certain trends with respect to life expectancy.

(discols <- colnames(diseases))
## [1] "Life.expectancy" "Hepatitis.B"     "Measles"         "Polio"          
## [5] "Diphtheria"      "HIV.AIDS"
(moncols <- colnames(money))
## [1] "Life.expectancy"                 "percentage.expenditure"         
## [3] "log_per_cap_GDP"                 "Income.composition.of.resources"
## [5] "GDP"
(wtcols <- colnames(weight))
## [1] "Life.expectancy"      "BMI"                  "thinness..1.19.years"
## [4] "thinness.5.9.years"
(lscols <- colnames(lifestyle))
## [1] "Life.expectancy"   "Alcohol"           "Adult.Mortality"  
## [4] "infant.deaths"     "under.five.deaths" "Schooling"        
## [7] "Population"        "Status"
money_model <- lm(Life.expectancy ~ percentage.expenditure + log_per_cap_GDP + Income.composition.of.resources + GDP, data = train_df)
disease_model <- lm(Life.expectancy ~ Hepatitis.B + Measles + Polio + Diphtheria + HIV.AIDS, data = train_df)
weight_model <- lm(Life.expectancy ~ BMI + thinness..1.19.years + thinness.5.9.years, data = train_df)
lifestyle_model <- lm(Life.expectancy ~ Alcohol + Adult.Mortality + infant.deaths + under.five.deaths + Schooling + Status, data = train_df)

Money Model

(s_money <- summary(money_model))
## 
## Call:
## lm(formula = Life.expectancy ~ percentage.expenditure + log_per_cap_GDP + 
##     Income.composition.of.resources + GDP, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.761  -2.848   0.578   3.251  26.906 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.21e+01   7.90e-01   65.96  < 2e-16 ***
## percentage.expenditure          4.58e-04   3.11e-04    1.47     0.14    
## log_per_cap_GDP                 3.70e-01   5.44e-02    6.79  1.7e-11 ***
## Income.composition.of.resources 3.04e+01   1.01e+00   30.02  < 2e-16 ***
## GDP                             1.62e-05   5.03e-05    0.32     0.75    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.91 on 1314 degrees of freedom
## Multiple R-squared:  0.558,  Adjusted R-squared:  0.557 
## F-statistic:  415 on 4 and 1314 DF,  p-value: <2e-16

When we look at monetary factors, log-per_cap_GDP is highly significant, as is Income composition of resources. GDP and percentage.expenditure are not. We may want to eliminate them from further consideration.

Disease Model

(s_disease <- summary(disease_model))
## 
## Call:
## lm(formula = Life.expectancy ~ Hepatitis.B + Measles + Polio + 
##     Diphtheria + HIV.AIDS, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.328  -4.936   0.498   4.225  17.775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.1682329  0.8187009   73.49  < 2e-16 ***
## Hepatitis.B -0.0165468  0.0090506   -1.83   0.0677 .  
## Measles     -0.0000511  0.0000177   -2.89   0.0039 ** 
## Polio        0.0648138  0.0104955    6.18  8.8e-10 ***
## Diphtheria   0.0809831  0.0117728    6.88  9.3e-12 ***
## HIV.AIDS    -0.8244807  0.0311605  -26.46  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.63 on 1313 degrees of freedom
## Multiple R-squared:  0.444,  Adjusted R-squared:  0.442 
## F-statistic:  210 on 5 and 1313 DF,  p-value: <2e-16

At a glance, it seems like Measles, Polio, Diphtheria, and AIDS all have highly significant effects on life expectancy. For some reason, Hepatitis.B does not have a statistically significant effect.

Weight Model

(s_weight <- summary(weight_model))
## 
## Call:
## lm(formula = Life.expectancy ~ BMI + thinness..1.19.years + thinness.5.9.years, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.265  -4.473   0.533   4.595  23.749 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           64.2041     0.6684   96.06   <2e-16 ***
## BMI                    0.1931     0.0121   16.00   <2e-16 ***
## thinness..1.19.years  -0.2729     0.1116   -2.45    0.015 *  
## thinness.5.9.years    -0.1898     0.1104   -1.72    0.086 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.14 on 1315 degrees of freedom
## Multiple R-squared:  0.355,  Adjusted R-squared:  0.353 
## F-statistic:  241 on 3 and 1315 DF,  p-value: <2e-16

Finally, it appears that thinness 1.19 years is an important factor, as is BMI, but thinness 5.9 years is not significant.

Lifestyle Model

(s_lifestyle <- summary(lifestyle_model))
## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + Adult.Mortality + infant.deaths + 
##     under.five.deaths + Schooling + Status, data = train_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.87  -2.33   0.39   2.82  13.63 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       58.52329    0.95268   61.43  < 2e-16 ***
## Alcohol           -0.03344    0.04406   -0.76     0.45    
## Adult.Mortality   -0.03222    0.00113  -28.54  < 2e-16 ***
## infant.deaths      0.09922    0.01289    7.70  2.7e-14 ***
## under.five.deaths -0.07600    0.00954   -7.97  3.5e-15 ***
## Schooling          1.49480    0.06276   23.82  < 2e-16 ***
## StatusDeveloping  -1.86485    0.45235   -4.12  4.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.5 on 1312 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.743 
## F-statistic:  636 on 6 and 1312 DF,  p-value: <2e-16

Out of the lifestyle variables, it appears that Alcohol is a weak predictor, with a high p-value. We may want to jettison that variable. Also, as we noted earlier, infant.deaths and under.five.deaths are practically the same thing. "Under five"" seems to be more inclusive (it may include infants), so we discard infant.deaths.

Reduced Models: Eliminating Collinear Predictors

So far it looks as though we should eliminate percentage.expenditure and GDP from the money model; Hepatitis from the disease model, thinness.5.9.years from the weight model, and Alcohol and infant.deaths from the lifestyle model.

Let's create new models without these unhelpful predictors and see if they end up being better.

Money

money_reduced <- lm(Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources, data = train_df)
s_money_reduced <- summary(money_reduced)
orig_r2 <- s_money$adj.r.squared
red_r2 <- s_money_reduced$adj.r.squared
Adj. \(R^2\) Predictors
Original Model 0.5566 4
Reduced Model 0.5473 2

Here we can see there is a 1.6667% decrease in Adjusted \(R^2\) when reducing the size of the model by 2 predictors.

Disease

disease_reduced <- lm(Life.expectancy ~ Measles + Polio + Diphtheria + HIV.AIDS, data = train_df)
s_disease_reduced <- summary(disease_reduced)
orig_r2 <- s_disease$adj.r.squared
red_r2 <- s_disease_reduced$adj.r.squared
Adj. \(R^2\) Predictors
Original Model 0.4422 5
Reduced Model 0.4412 4

Here we can see there is a 0.2249% decrease in Adjusted \(R^2\) when dereasing the size of the model by 1 predictor.

Weight

weight_reduced <- lm(Life.expectancy ~ BMI + thinness..1.19.years, data = train_df)
s_weight_reduced <- summary(weight_reduced)
orig_r2 <- s_weight$adj.r.squared
red_r2 <- s_weight_reduced$adj.r.squared
Adj. \(R^2\) Predictors
Original Model 0.3533 3
Reduced Model 0.3524 2

Downsizing the weight model by 1 predictor results in a 0.2718% decrease in Adjusted \(R^2\) .

Lifestyle

lifestyle_reduced <- lm(Life.expectancy ~ Adult.Mortality + under.five.deaths + Schooling + Population + Status, data = train_df)
s_lifestyle_reduced <- summary(lifestyle_reduced)
orig_r2 <- s_lifestyle$adj.r.squared
red_r2 <- s_lifestyle_reduced$adj.r.squared
Adj. \(R^2\) Predictors
Original Model 0.7431 6
Reduced Model 0.732 5

Here we can see there is a 1.4886% decrease in Adjusted \(R^2\) and while decreasing the size of the model by 1 predictor.

In all four cases, when assessed using adjusted R-squared, the reduced models seem to suffer very little with respect to their corresponding fuller models. We appear to be justified in using the smaller models, which we will be using below.

Assumptions Tests

Breusch-Pagan Test for Heteroscedasticity

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bp_money_red <- bptest(money_reduced)$p.value
bp_disease_red <-bptest(disease_reduced)$p.value
bp_weight_red <-bptest(weight_reduced)$p.value
bp_lifestyle_red <-bptest(lifestyle_reduced)$p.value
Reduced Model Breusch-Pagan P-Value
Money 2.200210^{-83}
Disease 1.949810^{-12}
Weight 2.171810^{-38}
Lifestyle 1.568810^{-23}

Seeing that all of our p-values are extremely low for the Breusch-Pagan Test, we have to assume that all our models fail the Equal Variance Assumption. We can confirm this by looking at the Residual-Fitted graphs.

par(mfrow = c(2,2))
#Money Model
plot(fitted(money_reduced), resid(money_reduced),  col = "darkgreen",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")
#Disease Model
plot(fitted(disease_reduced), resid(disease_reduced),  col = "darkred",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Disease Model")
abline(h = 0, lwd = 3, col = "black")
#Weight Model
plot(fitted(weight_reduced), resid(weight_reduced),  col = "darkblue",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Weight Model")
abline(h = 0, lwd = 3, col = "purple")
#Lifestyle Model
plot(fitted(lifestyle_reduced), resid(lifestyle_reduced),  col = "cadetblue4",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")

Observations:

  • For the disease observations, its clear that the variance is increasing as the fitted values increase, rather than staying around the same range.
  • In the case of money, the fitted vs residuals plot is problematic, as on the lower end of fitted values, it is as if an entirely different association were taking place. Linearity is somehow violated for low values (nearly all the residuals are above a mean of 0.) Variance does not look horrific outside of the lower range in that plot, but it is certainly not perfect. We suspect that the strange behavior for lower values suggests that there must be an interaction variable of some kind. Perhaps certain countries, or countries with a certain level of per-capita GDP operate by an entirely different model?
  • For weight, the fitted residuals plot looks OK in the middle ranges, but at the outer ends, we are clearly not getting a linear response, with all values above 0 for the lower fitted values, and all values below 0 for the higher fitted values.
  • The fitted vs residuals plot is not terrible, though normality is a bit suspect in the middle ranges (esp. around 70), and one gets the impression that values are more weighted below the 0 line.

Shapiro-Wilks Test

Reduced Model Shapiro-Wilks P-Value
Money 1.489210^{-22}
Disease 0.0007
Weight 1.746210^{-8}
Lifestyle 2.496510^{-25}

Seeing that all of our p-values are extremely low for the Shapiro-Wilks Test as well, we have to assume that all our models fail the Normality Assumption. We can check the QQ plots to be sure.

par(mfrow = c(2,2))
#Money Model
qqnorm(resid(money_reduced), main = "Reduced Money Model", col = "darkgreen")
qqline(resid(money_reduced))
#Disease Model
qqnorm(resid(disease_reduced), main = "Reduced Disease Model", col = "darkred")
qqline(resid(disease_reduced))
#Weight Model
qqnorm(resid(weight_reduced), main = "Reduced Weight Model", col = "darkblue")
qqline(resid(weight_reduced))
#Lifestyle Model
qqnorm(resid(lifestyle_reduced), main = "Reduced Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_reduced))

Observations:

  • The QQ plot for money looks terrible, with normality clearly violated at extreme values, suggesting that the normality of errors is clearly violated outside the middle quantiles.
  • In the case of disease, the QQ plot is almost ideal except for the very extremes
  • The QQ Plot for weight is also almost ideal, except for the lower quantile.
  • The QQ plot for lifestyle also fails badly at the lower quantiles.

Variance Inflation Factor

car::vif(money_reduced)
##                 log_per_cap_GDP Income.composition.of.resources 
##                           1.113                           1.113
car::vif(disease_reduced)
##    Measles      Polio Diphtheria   HIV.AIDS 
##      1.005      1.620      1.627      1.020
car::vif(weight_reduced)
##                  BMI thinness..1.19.years 
##                1.434                1.434
car::vif(lifestyle_reduced)
##   Adult.Mortality under.five.deaths         Schooling        Population 
##             1.229             2.061             1.630             1.960 
##            Status 
##             1.371

All of these values look quite good. By eliminating highly correlated variables, we seem to have eliminated collinearity within these sub-models, though by performing transformations, we may possibly get even better values.

Transformed Models: Fixing the Equal Variance and Normality Assumptions

Currently our reduced models are better than the original ones, but they still have some issues with the Equal Variance and Normality assumptions. By performing some transformations on the variables however, we can improve on them significantly. (All transformations take into account that outlier data is not being removed.)

All variable transformation functions were decided by looking at the pairs plot between Life.expectancy and that particular value as well as experimenting with different functions (log, exp, ^2, ^3, 1/x, etc) while increasing the \(Adj. R^2\) value (if possible) and keeping size of the model relatively low.

Money

money_trans <- lm(Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3), data = train_df)
s_money_trans <- summary(money_trans)
red_r2 <- s_money_reduced$adj.r.squared
trans_r2 <- s_money_trans$adj.r.squared
Adj. \(R^2\) Predictors
Reduced Model 0.5473 2
Transformed Model 0.7355 4

By adding 2 predictors, the transformed money model achieves a 34.3691% increase in Adjusted \(R^2\) .

Disease

disease_trans <- lm(Life.expectancy ~ Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS), data = train_df)
s_disease_trans <- summary(disease_trans)
red_r2 <- s_disease_reduced$adj.r.squared
trans_r2 <- s_disease_trans$adj.r.squared
Adj. \(R^2\) Predictors
Reduced Model 0.4412 4
Transformed Model 0.6682 5

With the transformed disease model, we achieve an 51.4742% increase in Adjusted \(R^2\) while increasing the size of the model by only 1 predictor. This is a huge improvement.

Weight

weight_trans <- lm(Life.expectancy ~ BMI + thinness..1.19.years + exp(thinness..1.19.years), data = train_df)
s_weight_trans <- summary(weight_trans)
red_r2 <- s_weight_reduced$adj.r.squared
trans_r2 <- s_weight_trans$adj.r.squared
Adj. \(R^2\) Predictors
Reduced Model 0.3524 2
Transformed Model 0.3638 3

The transformed weight model achieves a 3.2397% increase in Adjusted \(R^2\) by adding 1 predictor. The increase in Adjusted \(R^2\) is so small that it might be better to stick with the reduced model.

Lifestyle

lifestyle_trans <- lm(Life.expectancy ~ Adult.Mortality + under.five.deaths + Schooling + Population + Status + sqrt(Schooling) , data = train_df)
s_lifestyle_trans <- summary(lifestyle_trans)
red_r2 <- s_lifestyle_reduced$adj.r.squared
trans_r2 <- s_lifestyle_trans$adj.r.squared
Adj. \(R^2\) Predictors
Reduced Model 0.732 5
Transformed Model 0.7319 6

Here we can see there is a -0.0191% decrease in Adjusted \(R^2\) and while increasing the size of the model by 1 predictors. It would definitely be better to stick with the reduced in this case.

We have the Adjusted \(R^2\) for all the new models using the transformed variables, so we have a decent idea of which reduced we will be replacing with their transformed counterparts, and which will be staying the same. However, we should check to see how our transformed models treat our linear model assumptions

Assumptions Tests

Breusch-Pagan Test for Heteroscedasticity

library(lmtest)
bp_money_trans <- bptest(money_trans)$p.value
bp_disease_trans <-bptest(disease_trans)$p.value
bp_weight_trans <-bptest(weight_trans)$p.value
bp_lifestyle_trans <-bptest(lifestyle_trans)$p.value
Model BP P-Val (Reduced) BP P-Val (Transformed)
Money 2.200210^{-83} 2.962110^{-21}
Disease 1.949810^{-12} 4.178410^{-5}
Weight 2.171810^{-38} 1.252710^{-37}
Lifestyle 1.568810^{-23} 7.34310^{-23}

We can see that there is a large improvement for the money and disease models, whereas there are much smaller improvements for weight and lifestyle Unfortunately, all the models appear to still be failing the Equal Variance assumption, which we can confirm by looking at the new Fitted_Residual Graphs.

par(mfrow = c(1,2))
#Reduced Money Model
plot(fitted(money_reduced), resid(money_reduced),  col = "darkgreen",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")
#Transformed Money Model
plot(fitted(money_trans), resid(money_trans),  col = "darkgreen",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Transformed Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")

par(mfrow = c(1,2))
#Reduced Disease Model
plot(fitted(disease_reduced), resid(disease_reduced),  col = "darkred",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Disease Model")
abline(h = 0, lwd = 3, col = "black")
#Transformed Disease Model
plot(fitted(disease_trans), resid(disease_trans),  col = "darkred",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Transformed Disease Model")
abline(h = 0, lwd = 3, col = "black")

par(mfrow = c(1,2))
#Reduced Weight Model
plot(fitted(weight_reduced), resid(weight_reduced),  col = "darkblue",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Weight Model")
abline(h = 0, lwd = 3, col = "purple")
#Transformed Weight Model
plot(fitted(weight_trans), resid(weight_trans),  col = "darkblue",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Transformed Weight Model")
abline(h = 0, lwd = 3, col = "purple")

par(mfrow = c(1,2))
#Reduced Lifestyle Model
plot(fitted(lifestyle_reduced), resid(lifestyle_reduced),  col = "cadetblue4",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Reduced Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")
#Transformed Lifestyle Model
plot(fitted(lifestyle_trans), resid(lifestyle_trans),  col = "cadetblue4",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "Transformed Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")

Observations:

Our graphs confirm what our results of our BP tests. The money model and disease model both improved greatly, while the weight and lifestyle models stayed basically the same.

We also want to see if there has been any improvement in the Normality Assumptions so we shall run the Shapiro-Wilks test and compare.

Shapiro-Wilks Test

Reduced Model (Reduced) Shapiro-Wilks P-Value (Transformed) Shapiro-Wilks P-Value
Money 1.489210^{-22} 7.638710^{-15}
Disease 0.0007 2.563710^{-5}
Weight 1.746210^{-8} 1.781910^{-8}
Lifestyle 2.496510^{-25} 2.641310^{-25}

These values are strange, it appears that the only model who's P-value improved was the money model. The weight and lifestyle models stayed relatively the same while the disease model's p-value actually got worse. Let's look at the graphs while comparing them to the reduced models.

par(mfrow = c(1,2))
#Reduced Money Model
qqnorm(resid(money_reduced), main = "Reduced Money Model", col = "darkgreen")
qqline(resid(money_reduced))
#Transformed Money Model
qqnorm(resid(money_trans), main = "Transformed Money Model", col = "darkgreen")
qqline(resid(money_trans))

As expected, there is a much better improvement in the upper quantile for the money model.

par(mfrow = c(1,2))
#Reduced Disease Model
qqnorm(resid(disease_reduced), main = "Reduced Disease Model", col = "darkred")
qqline(resid(disease_reduced))
#Transformed Disease Model
qqnorm(resid(disease_trans), main = "Transformed Disease Model", col = "darkred")
qqline(resid(disease_trans))

At first glance, there appears to be a slight deterioration in both the upper and lower quantiles of the new graph. But if one considers the scales of the Y-axis, one can see that the new model might actually be an improvement.

par(mfrow = c(1,2))
#Reduced Weight Model
qqnorm(resid(weight_reduced), main = "Reduced Weight Model", col = "darkblue")
qqline(resid(weight_reduced))
#Transformed Weight Model
qqnorm(resid(weight_trans), main = "Transformed Weight Model", col = "darkblue")
qqline(resid(weight_trans))

The weight graph appears to have not really changed at all, which corresponds to the new SW p-value.

par(mfrow = c(1,2))
#Reduced Lifestyle Model
qqnorm(resid(lifestyle_reduced), main = "Reduced Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_reduced))
#Transformed Lifestyle Model
qqnorm(resid(lifestyle_trans), main = "Transformed Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_trans))

There is a very slight improvement in the upper quartile of the lifestyle graph, while the lower quartile remains the same.

ANOVA Test
Since all the Reduced Models are nested within the Transformed models, we can also run ANOVA tests to see if there is a significant improvement in the transformed models.

Model Reduced vs Transformed P-Value
Money 2.033310^{-154}
Disease 4.403210^{-151}
Weight 7.920110^{-7}
Lifestyle 0.5753

Preferred Models out of the Reduced and Models
After comparing all these different tests--namely, ANOVA, SW-Test, BP-Test--and taking in account the change in number of parameters and change in Adjusted \(R^2\), we believe that we should continue forward with the following models:

  • Transformed Money Model
  • Transformed Disease Model
  • Reduced Weight Model
  • Reduced Lifestyle Model

Combining the Preferred Thematic Models

Thus far, we have identified promising variables in each of several categories (themes), and we have also identified a few useful transformations. The next step is to try to create a model that combines data from all four categories of variables into one larger model. There are a number of ways to do this. We have chosen to keep all the variables we have identified as "good" within the scope, and use bi-directional stepping with AIC and BIC.

Below, we employ the transformed money model as the starter, though, in principle, any of the above four selected models should work equally well.

AIC

starter <- money_trans

aic_model <- step(starter,
                       scope = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS) + BMI + thinness..1.19.years + Adult.Mortality + under.five.deaths + Schooling + Population + Status,
                       direction = "both",
                       trace = 0)

summary(aic_model)
## 
## Call:
## lm(formula = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + 
##     I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + 
##     log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths + 
##     Population + BMI + Diphtheria, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.836  -1.697  -0.017   1.807  10.573 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           6.60e+01   7.06e-01   93.53  < 2e-16 ***
## log_per_cap_GDP                       6.80e-02   3.11e-02    2.18  0.02908 *  
## Income.composition.of.resources      -4.01e+01   4.48e+00   -8.94  < 2e-16 ***
## I(Income.composition.of.resources^2)  9.72e+01   1.11e+01    8.79  < 2e-16 ***
## I(Income.composition.of.resources^3) -4.17e+01   7.40e+00   -5.64  2.1e-08 ***
## log(HIV.AIDS)                        -1.17e+00   1.07e-01  -10.92  < 2e-16 ***
## Adult.Mortality                      -1.30e-02   9.61e-04  -13.52  < 2e-16 ***
## HIV.AIDS                             -2.15e-01   2.30e-02   -9.33  < 2e-16 ***
## under.five.deaths                    -2.64e-03   7.43e-04   -3.56  0.00039 ***
## Population                            3.90e-09   1.82e-09    2.15  0.03208 *  
## BMI                                   1.29e-02   5.73e-03    2.25  0.02484 *  
## Diphtheria                            7.83e-03   4.33e-03    1.81  0.07069 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 1307 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.874 
## F-statistic:  833 on 11 and 1307 DF,  p-value: <2e-16

BIC

n <- nrow(train_df)

bic_model <- step(starter, k = log(n),
                       scope = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS) + BMI + thinness..1.19.years + Adult.Mortality + under.five.deaths + Schooling + Population + Status,
                       direction = "both",
                       trace = 0)

summary(bic_model)
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + 
##     log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.944  -1.736  -0.031   1.812  10.911 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           66.555779   0.590117  112.78  < 2e-16 ***
## Income.composition.of.resources      -44.468727   4.346403  -10.23  < 2e-16 ***
## I(Income.composition.of.resources^2) 108.625382  10.685279   10.17  < 2e-16 ***
## I(Income.composition.of.resources^3) -48.185565   7.230695   -6.66  3.9e-11 ***
## log(HIV.AIDS)                         -1.203883   0.107116  -11.24  < 2e-16 ***
## Adult.Mortality                       -0.013152   0.000962  -13.67  < 2e-16 ***
## HIV.AIDS                              -0.206210   0.023001   -8.97  < 2e-16 ***
## under.five.deaths                     -0.002080   0.000525   -3.96  7.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.16 on 1311 degrees of freedom
## Multiple R-squared:  0.874,  Adjusted R-squared:  0.873 
## F-statistic: 1.29e+03 on 7 and 1311 DF,  p-value: <2e-16
not_in_BIC <- names(coef(aic_model))[!(names(coef(aic_model)) %in% names(coef(bic_model)))]

The AIC model is larger, containing all the same predictors as BIC as well as log_per_cap_GDP, Population, BMI, Diphtheria. This makes sense, as BIC penalizes larger models more heavily and tends to result in fewer predictors. Let's compare the models using an ANOVA test to see whether the AIC model is significantly better.

ANOVA P-val: 0.0037

This p-val suggests that we should move forward with the AIC model. To confirm that it is the best model we have, we also compare it with the four category models we selected earlier, as well as a new full additive model that contains all the variables of our data frame (except for the factor variables) as well as any transformed variables from our smaller models.

full_model <- lm(Life.expectancy ~ . - GDP - Population + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + log(HIV.AIDS), data = train_df) #We are also removing the GDP and Population Params as we will use the log_per_cap_GDP instead
summary(full_model)
## 
## Call:
## lm(formula = Life.expectancy ~ . - GDP - Population + I(Income.composition.of.resources^2) + 
##     I(Income.composition.of.resources^3) + log(HIV.AIDS), data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.103  -1.726  -0.031   1.655  11.188 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           6.80e+01   1.14e+00   59.93  < 2e-16 ***
## StatusDeveloping                     -5.45e-01   3.49e-01   -1.56  0.11841    
## Adult.Mortality                      -1.17e-02   9.46e-04  -12.33  < 2e-16 ***
## infant.deaths                         4.68e-02   9.68e-03    4.84  1.5e-06 ***
## Alcohol                              -1.59e-01   3.24e-02   -4.90  1.1e-06 ***
## percentage.expenditure                2.28e-04   6.45e-05    3.53  0.00042 ***
## Hepatitis.B                          -5.25e-03   4.33e-03   -1.21  0.22507    
## Measles                              -2.78e-06   9.86e-06   -0.28  0.77820    
## BMI                                   8.13e-03   5.99e-03    1.36  0.17507    
## under.five.deaths                    -3.53e-02   7.07e-03   -4.99  6.9e-07 ***
## Polio                                -1.59e-03   4.96e-03   -0.32  0.74901    
## Total.expenditure                     1.75e-01   4.04e-02    4.33  1.6e-05 ***
## Diphtheria                            9.72e-03   5.62e-03    1.73  0.08427 .  
## HIV.AIDS                             -2.35e-01   2.30e-02  -10.23  < 2e-16 ***
## thinness..1.19.years                  5.53e-02   4.92e-02    1.12  0.26124    
## thinness.5.9.years                   -1.18e-01   4.87e-02   -2.43  0.01506 *  
## Income.composition.of.resources      -5.57e+01   5.21e+00  -10.70  < 2e-16 ***
## Schooling                            -1.10e-01   7.39e-02   -1.49  0.13724    
## per_cap_GDP                           3.02e-02   2.28e-02    1.33  0.18529    
## log_per_cap_GDP                       1.12e-02   3.12e-02    0.36  0.72037    
## I(Income.composition.of.resources^2)  1.38e+02   1.33e+01   10.42  < 2e-16 ***
## I(Income.composition.of.resources^3) -6.75e+01   8.88e+00   -7.60  5.8e-14 ***
## log(HIV.AIDS)                        -9.20e-01   1.11e-01   -8.32  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.06 on 1296 degrees of freedom
## Multiple R-squared:  0.883,  Adjusted R-squared:  0.881 
## F-statistic:  446 on 22 and 1296 DF,  p-value: <2e-16
full_r2 <- summary(full_model)$adj.r.squared
aic_r2 <- summary(aic_model)$adj.r.squared
#This function will be used to calculate the RMSE of our models
rmse = function(model, training = TRUE){
  if(training){
    y = train_df$Life.expectancy
    y_hat = predict(model,newdata = train_df)
    n = nrow(train_df)
  } else{ 
    y = test_df$Life.expectancy
    y_hat = predict(model,newdata = test_df)
    n = nrow(test_df)
  }
  sqrt(sum((y - y_hat) ^ 2) / n)
}
AIC Model vs ANOVA P-val
Transformed Money 4.450210^{-207}
Transformed Disease 5.168710^{-272}
Reduced Weight 0
Reduced Lifestyle 1.51710^{-211}
Full 4.324810^{-14}

According to these results, our AIC model is better than all of the models except for the Full model. However, when we look at the adjusted R-squared values:

Model Adjusted \(R^2\) No. Of Predictors
AIC 0.8741 11
Full 0.8813 22

we can see that there is a 0.8272% increase in Adjusted \(R^2\) while the model size increases by 11 predictors. It would definitely be better to stick with the AIC Model. Let's now confirm these results using test data.

Comparing Train and Test Data Performance

So far we have not considered the possibility of overfitting. It would make sense, therefore, to validate our preliminary conclusions using test datasets to see how these models behave.

Model RMSE (train) RMSE (test) No of Predictors
AIC Model 3.1354 3.1677 11
Full Model 3.0312 3.0279 22

Looking at the RMSE we can see that the full model has a lower RMSE for both the train and test data, but it also has almost twice as many predictors. Since smaller models are generally preferable and the improvement is not that large, we select the AIC-derived combined model as our final choice.

Results

bp_aic <- bptest(aic_model)$p.value
sw_aic <- shapiro.test(resid(aic_model))$p.value
adj_r2 <- summary(aic_model)$adj.r.squared
summary(aic_model)
## 
## Call:
## lm(formula = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + 
##     I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + 
##     log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths + 
##     Population + BMI + Diphtheria, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.836  -1.697  -0.017   1.807  10.573 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           6.60e+01   7.06e-01   93.53  < 2e-16 ***
## log_per_cap_GDP                       6.80e-02   3.11e-02    2.18  0.02908 *  
## Income.composition.of.resources      -4.01e+01   4.48e+00   -8.94  < 2e-16 ***
## I(Income.composition.of.resources^2)  9.72e+01   1.11e+01    8.79  < 2e-16 ***
## I(Income.composition.of.resources^3) -4.17e+01   7.40e+00   -5.64  2.1e-08 ***
## log(HIV.AIDS)                        -1.17e+00   1.07e-01  -10.92  < 2e-16 ***
## Adult.Mortality                      -1.30e-02   9.61e-04  -13.52  < 2e-16 ***
## HIV.AIDS                             -2.15e-01   2.30e-02   -9.33  < 2e-16 ***
## under.five.deaths                    -2.64e-03   7.43e-04   -3.56  0.00039 ***
## Population                            3.90e-09   1.82e-09    2.15  0.03208 *  
## BMI                                   1.29e-02   5.73e-03    2.25  0.02484 *  
## Diphtheria                            7.83e-03   4.33e-03    1.81  0.07069 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 1307 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.874 
## F-statistic:  833 on 11 and 1307 DF,  p-value: <2e-16

Our final model has the following properties:

No. of Predictors 11
BP-Test P Val 9.885410^{-6}
SW-Test P Val 7.150810^{-11}
Adjusted $R^2 0.8741
plot(fitted(aic_model), resid(aic_model),  col = "purple",
     ylab = "Residuals",
     xlab = "Fitted Values",
     main = "AIC-Derived Combined Model")
abline(h = 0, lwd = 3, col = "darkblue")

qqnorm(resid(aic_model), main = "AIC-Derived Combined Model", col = "purple")
qqline(resid(aic_model))

Discussion

Our final model has a high \(R^2\) value, which is good, but still has some issues with normality assumptions, especially in the lower quartiles. This makes sense because in the intial pair plots, many of the graphs had a clusters of data points in the lower ranges. By removing these outliers, we could have improved our model's BP and SW test results. However, without understanding why those data points were so separate from the rest of the data, we deemed that it would be unwise to remove them, and decided to work with them instead.

We also noticed that many pair plots seemed to point to complex collinear relationships between pairs of variables, where more than one linear relationships could be identified simultaneously. We briefly tried to identify possible interactions, but failed to find a variable that might explain such behaviors.

Overall, breaking a large number of variables into thematic categories proved useful in quickly eliminating collinear variables that had equivalent effects on life expectancy. We were able to combine these models into a larger that took all categories of variables into account. Ultimately, however, less-intuitive but ultimately more rewarding transformations discovered through trial and error were needed to produce a better predictive model.

Appendix

Group Members:
+ Warren Child
+ Zoheb Satta
+ Yoga Mahalingam